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Abstract 

Based on the concept of energy landscape a picture of the mismatch between 
the reduced interaction matrix of residues and the matrix of statistical con- 
tact potentials is presented. For the Miyazawa and Jernigan (MJ) matrix, 
rational groupings of 20 kinds of residues with minimal mismatches under 
the consideration of local minima and statistics on correlation between the 
residues are studied. A hierarchical tree of groupings relating to different 
numbers of groups N is obtained, and a plateau around N = 8 ~ 10 is found, 
which may represent the basic degree of freedom of the sequence complexity 
of proteins. 
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Using a small set of amino acid residues to reduce the sequence complexity in proteins 
has been made theoretically ]I] and experimentally 0,0]. Some patterns of residues were 
discovered in the reconstruction of secondary structures, such as binary patterns in a-helices 
and helix bundles (see review |3J and references therein). These experiments imply 
that not only the hydrophobic cores and the native structures but also the rapid folding 
behaviors of proteins can all be realized by simplified alphabets of the residues. These 
findings suggest the existence of some small sets of residues for characterizing the diversity 
of protein sequences. Theoretically, the simplest reduction, the so-called HP model including 
H group with hydrophobic residues and P group with polar residues, has been extensively 
used. Yet, the relation between different forms or levels of these reductions (such as the 
5- letter palette ||, or different forms of HP groupings 0||) and the original sequences are 
not generally established. To find out its physical origin is of importance for the reduction 
of protein representation. 

Previously, based on the Miyazawa and Jernigan (MJ) matrix of contact potentials of 
residues J7J, we made reductions by grouping residues into different groups ||. We found 
possible simplified schemes from minimized mismatches between reduced interaction matrix 
and the original MJ one. Here we report a physical picture of mismatch based on the 
concept of energy landscape and some rational groupings. Statistics on correlation between 
the residues shows that some residues tend to aggregate together or are friends to live in 
a same group. These enable us to settle the groupings. A plateau of mismatch around 
group number N = 8 ~ 10 for three different interaction matrices is found, implying that 
groupings with N = 8 ~ 10 may provide a rational set for protein reduction. This coincides 
with a fact that proteins generally include more than 7 types of residues 

To divide 20 types of residues into a number of groups, the basic principle may be that 
the residues in a group should be similar in their physical aspects, mainly the interactions. 
After grouping, the residues in a group could be represented by one of residues belonging to 
the group, thus the complexity of protein sequence is reduced. When a residue is replaced 
by another, the energy landscape of a protein should not change its main feature (the shape) 
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or the folding features are basically the same. This is the case, especially when the system 
is near the bottom of the funnel where a protein has the most compact conformations. 
The energy difference between two nearby conformations (cl) and (c2) is defined as AE = 
J2 n [ e n { s ii s j) ~ e n( s ki s i)] where e n is contact energy of contact n between two residues, 
Si is the residue type of z-th element in the protein sequence, and the number of contacts 
in two conformations are assumed to be the same. To keep the main feature of the energy 
landscape means that AE should not change its sign, i.e., 

sign[AE new ] = sign[AE old ] (1) 

when a residue s g (g = i,j,k or I) in AE is substituted by one of its 'friends' s' g in the 
same group. Any discrepancy of Eq.(l) may change the energy landscape, and a quantity 
"mismatch" is introduced to characterize the discrepancy between the original protein and 
its substitute. Thus, the mismatch acts as a quantitative non-fitness of substitutions of 
residues for a certain grouping. 

In details, 20 natural residues are partitioned into iV groups as G\, • • • , Gn groups 
A, B and so on in Ref. ||) with residues in group Gi, where n\ + n 2 + ■ ■ ■ + = 20. 
Different values of rii give different "sets" of the partition, and different arrangements of 
residues into a given set represent different "distributions" of the residues. The groups 
for a certain group number N are represented as Qn = {{G^(N),K = 1,N},1 = 1,L N } 
where G^(N) means the i^-th group in the Z-th set among the total sets with L/v ||. For a 
certain set, the mismatch will be minimized if the residues are friends belonging to iV groups. 
[The residues which are not aggregated together finally in a group are not friends.] Due to 
the arbitrariness of contact index and various possible distributions of residues, we define a 
strong requirement for a successful grouping: no change of the sign for a substitution in AE, 
i.e., \(siSjS k si) = sign[e(si, Sj) - e(s k ,Si)] equals to A(s-s j s fc s i ) = sign^s'^Sj) - e(s k , Si)), 
e.g., when s, is substituted by one of its friends s[. Here Sj, Sj, s k or si belong to group 
G a , G/3, G 1 or Gs with a,/3,j,5 G 1,2, ■ • • , N, respectively. Generally, when a residue is 
substituted by another residue (friend or non-friend) from a same group well done or not 
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well done, one always has X(s[sjSkSi) = 1 or or —1. Then, all possible substitutions give 
a sum of related values of A, i.e., A(G a , Gp, G 7 , Gg) = YX 1 ^( s i s j s kSi) which 

describes the total effects of substitutions of the residues from four groups of G a ,Gg ,C~p 
and Gs- If X^SjSkSi) is not the same as sign[A] (obviously sign[A] = P in Ref. 0), the 
substitution Sj — > s[ is not favorable or the grouping of residues of and in a group is 
a mismatch one. The average over all residues gives out the total mismatch M a b of this 
distribution. Detailed form of the mismatch see Ref. |§. 

When the element number rii in each group is fixed, different distributions of residues in 
different groups may result in fluctuant mismatches. Among all the distributions, the best 
distribution (or the best arrangement of the residues) makes a minimal mismatch M^mm 
for a certain set (nj., ri2, ■ ■ ■ , Jijv). To find out M a b m in, a Monte Carlo (MC) minimization 
procedure || is used. An enumeration over all possible distributions of residues can also be 
made for small N. With a fixed group number N, we have a number of different sets which 
give different minimal mismatches M a b min . In principle, for a certain group number N, we 
could chose the lowest mismatch and obtain the related grouping as the final result among 
all sets Ln- However this is difficult for those sets with many groups with a single-element 
(MGWSE) or groups with singlets. For example, as shown in Fig.l for the set (1, 19) the 
mismatch is the lowest among all 10 sets (also the set (1, 1, 1, 1, 16) for iV = 5, and so on, see 
Fig. 5). Obviously, this kind of mismatches does not relate to the best or rational grouping 
of the residues. Therefore, we must consider a local minimum (or plateaus) among all 
sets as the rational global minimum M g (see Fig.l). Such a "locality" is motivated from the 
similarity between two groupings. Two groupings are regarded as a couple of neighbors when 
they can transform to each other just by exchanging two residues between two groups or by 
moving one residue from one group to another. With this, all local minima (or plateaus) 
are identified and analyzed. As shown in Fig.l, obviously there is a local minimum (or a 
plateau) besides those with MGWSE. Generally, different minima have different grouping 
patterns as indicated in Fig.l. These local minima and plateaus may represent some better 
groupings, and may reflect some intrinsic affinity between the residues. As a result, they are 
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taken as the corresponding rational groupings with mismatch M g . It is worthy to note that 
for the grouping under some restrictions, such as keeping the hydrophobic group unchanged 
||, the picture of the minimal groupings is the same although the grouping space is limited. 

The aggregation of some friendly residues into a group implies some correlation between 
these residues. First let us consider the two-residue correlation C(Si,Sj) by counting the 
number of groups which include two residues Si and Sj. That is, the count is taken as a 
quantitative scale of the affinity between two residues, or a probability of two residues being 
in a same group, among all groups in for a certain N, i.e., the groups for all sets Ln 
with their related M a 6 min 's. Here the count C is defined as 

N L N 

CfaSj) = £ £/(^,Gf (AO) x I(Sj,G^(N)) (2) 

K=l 1=1 

where I(S, G) = 1 when S G G, or zero when S G. Clearly, a matrix of the correlation 
between all possible pairs of residues C(Si,Sj) can be obtained (see Fig.2). It is found 
that the counts for some pairs are much large than those for other pairs. This means that 
some residues are friends and some are repulsion between each other, reflecting effective 
"attraction" between the residues in a group and "repulsion" between residues in different 
groups. Note that for the groupings of different N, we have similar patterns. The probability 
for finding a certain group G with specified residues among all the minimal groups Qn can 
also be obtained by a count C\G) = Ek=i D(G, G$(N)) where D{G X , G 2 ) = 1 when 
G\ = G 2 , or for G\ ^ G 2 - As expected, different groups have different chances to appear 
(see Fig. 3). These differences result from not only the grouping affinity between residues 
but also the preference to the groups with a certain size. For comparison, the count C'(G) 
is normalized by the total number of groups with the same size of group G in the statistical 
set Qn . This normalized count is noted as a probability of the occurrence for group G 

N L N 

P(G) = C'(G)/IY: ^5(size(G),size(G^(N)))}, (3) 

K=l I 

where size{G) being the number of residues in group G, and 8(sizel, size2) being the 5- 
function. From Eq.(3) it is found that some groups have large probabilities P(G) and appear 
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many times with large number of the counts C'(G), implying that the residues in these groups 
having more chances to be in a group or that these groups having strong preference to appear 
in the grouping. Thus the grouping with these groups shows a better settlement of 20 kinds 
of residues than others. Note that there are some groups with high probabilities P(G), but 
small statistical counts C'{G). Such groups generally have large numbers of elements and 
only appear one or two times in Qn, which makes the normalization factor in Eq.(3) rather 
small. Clearly, these groups are removed in our analysis because of lacking of the statistical 
reliability. 

For the MJ matrix, as shown in Fig.4, the groupings follow a hierarchically tree-like 
structure. That is, 20 kinds of residues are firstly divided into two groups (also see Fig. la), 
namely the H group with residues ( C, M, F, I, L, V, W, Y) and the P group with residues 
{A, G, T, S, N, Q, D, E, H, R, K, P). Then the H and P groups are alternatively broken 
into two or more groups relating to different N. For example, for the case N = 3, the 
P group are divided into two small groups, i.e., (A, H, T) and (G, S, N, Q, D, E, R, K, 
P). For the case N = 5, the H group is divided into (F, I, L) and (C, M, V, W, Y), and 
the P group is divided into {A, H, T ), (D, E, K) and (G, S, N, Q, R, P), respectively. 
Similar results are obtained for N up to 9 with a sequential order of hydrophobicity without 
any overlap between the hydrophobic branch and the hydrophilic one following the H/P 
dividing. The difference between the present study and the previous one in Ref. |§ is that 
there are alternant breaking of the H and P groups in the new groupings, which gives out a 
little decreasing in the mismatches, and also slight different representative residues. 

Our new analysis relates to a clearer physical picture of the rational groupings. Following 
the tree-like groupings, one can see the dividing on the H groups or the P groups (see Fig.4). 
For the case of iV = 3, to divide the P group (on the base of iV = 2) is obviously more 
rational than to divide the H group, suggesting a priority for dividing the P group first. 
Differently, for the case of iV = 4, we should divide the H group first, and then for the case 
of TV = 5 divide the P group again. It is found that the dividing is alternant, reflecting 
the detailed differences between the interactions of the H and P groups. The former results 
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under some restrictions, such as to fix the H group (with 8 residues) unchanged, may relate 
to somewhat rough dividings, resulting in large mismatches (see the data for N = 3, 4, and 
5 in Ref. § and Fig. 5). 

Fig. 5 shows a monotonic decrease in the mismatch for N = 2 to 20, which implies the 
more groups the better. Besides, there is a plateau near N = 8 in this curve (case-A), 
which characterizes the saturation of the grouping. This means that more groups will not 
further decrease the mismatch or more groups might not greatly enhance the efficiency of the 
complexity reduction. Thus the number N = 8 may indicate the minimal number of types of 
residues to reconstruct the natural proteins, or a basic degree of freedom of the complexity 
for protein representation. This, in some sense, relates well to the result in the previous 
studies ||, and an argument in Ref. ||. Noted that the former plateau at N = 5 ceases 
due to the canceling of the grouping restriction. Interestingly, in Fig.5, we also plot all the 
lowest mismatches relating to the groupings with MGWSE which generally are not the local 
minima as discussed above. A typical example is the grouping with groups (1, 1, 1, 1, 16) 
with a mismatch M = 0.04747, which is the lowest one among all sets of N = 5. However, 
it is noted that even including all these trivial cases for iV = 2 to 14, the curve still shows a 
plateau around N = 9 with eight groups with single residue of C, M, F, I, L, V, W, Y and 
one group with the rest residues as well. Clearly, this plateau relates again to the saturation 
of the H and P grouping or the detailed differences between the residues of the interactions, 
and also gives out a support on the discussion for the N = 8 plateau above. As shown in 
Fig.5 (b), we have similar results for two other interaction matries 

Finally, we note that for each grouping with different N, we have found the representative 
residues for the MJ matrix, e.g., (I, A, D) for N = 3, (I, A, C, D) for iV = 4 and (I, A, G, 
E, C) for N = 5. The slight change in the representative residue for N = 5 is attributed 
to the different implementation and requirement in the grouping. The foldability and the 
effectiveness have also been studied, we will report on these elsewhere. 

In conclusion, we present a grouping method based on the requirement that energy 
landscape is basically kept in the reduction. A quantity, the mismatch, is taken as the 
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measurement of the goodness for the reduction. Our results imply that the residues do have 
some similarities in their interaction properties and can be put together into groups. Then 
by choosing a single residue for each group, the complexity of proteins can be reduced or 
the proteins can be represented with reduced compositions. Especially, a basic degree of 
freedom of the complexity with 8 ~ 10 types of residues is found. 

This work was supported by the Foundation of NNSF (No.19625409, 10074030). We 
thank C. Tang, C.H. Lee and H. S. Chan for comments and suggestions. 
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FIGURES 

FIG. 1. The mismatch M a b m i n for different sets for N = 2 (a) and N = 3 (b). The set index 
represents the sets marked in the figure. 

FIG. 2. A two-residue correlation statistics for Eq.(2) for the MJ matrix. Different shades of 
gray represent different values of the count C(S{, Sj) among all 527 groups for N = 5. 

FIG. 3. Probabilities P(G), i.e., Eq.(3), and the counts C'(G) of occurrence of groups G with 
N = 5. The group index is arranged following the magnitude of the probability of the groups. 
Some groups are labeled. 

FIG. 4. The rational groupings of a hierarchically tree-like structure for the MJ matrix for N 
up to 9. 

FIG. 5. The minimal mismatch M g vs. N: (a) for the MJ matrix; (b) for contact potentials 
in Ref.[6] (TD case) and in Ref.[9] (SW case). The plateaus are shown for different cases. 
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